Dynamic backup workers for parallel machine learning

نویسندگان

چکیده

The most popular framework for distributed training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists n workers, which iteratively compute updates model parameters, and a stateful PS, waits aggregates all to generate new estimate parameters sends it back workers iteration. Transient computation slowdowns or transmission delays can intolerably lengthen time each An efficient way mitigate this problem let PS wait only fastest n−b updates, before generating parameters. slowest b are called backup workers. correct choice number depends on cluster configuration workload, but also (as we show in paper) hyper-parameters algorithm current stage training. We propose DBW, an that dynamically decides during process maximize convergence speed at Our experiments DBW (1) removes necessity tune by preliminary time-consuming experiments, (2) makes up factor 3 faster than optimal static configuration.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Structure-Aware Dynamic Scheduler for Parallel Machine Learning

Training large machine learning (ML) models with many variables or parameters can take a long time if one employs sequential procedures even with stochastic updates. A natural solution is to turn to distributed computing on a cluster; however, naive, unstructured parallelization of ML algorithms does not usually lead to a proportional speedup and can even result in divergence, because dependenc...

متن کامل

Parallel Computing for Machine Learning

My name is Xinlei Pan. I am a first year graduate student in bioengineering department. I’m interested in machine learning and its application in biological data analysis. Typically examples include gene expression data analysis to construct gene regulatory network, EEG data analysis to study brain function, etc. For the machine learning part, I have a particular interest in probabilistic graph...

متن کامل

Two-stage fuzzy-stochastic programming for parallel machine scheduling problem with machine deterioration and operator learning effect

This paper deals with the determination of machine numbers and production schedules in manufacturing environments. In this line, a two-stage fuzzy stochastic programming model is discussed with fuzzy processing times where both deterioration and learning effects are evaluated simultaneously. The first stage focuses on the type and number of machines in order to minimize the total costs associat...

متن کامل

Online Dynamic Value System for Machine Learning

A novel online dynamic value system for machine learning is proposed in this paper. The proposed system has a dual network structure: data processing network (DPN) and information evaluation network (IEN). The DPN is responsible for numerical data processing, including input space transformation and online dynamic data fitting. The IEN evaluates results provided by DPN. A dynamic three-curve fi...

متن کامل

Machine learning for dynamic incentive problems∗

We propose a generic method for solving infinite-horizon, discrete-time dynamic incentive problems with hidden states. We first combine set-valued dynamic programming techniques with Bayesian Gaussian mixture models to determine irregularly shaped equilibrium value correspondences. Second, we generate training data from those pre-computed feasible sets to recursively solve the dynamic incentive...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computer Networks

سال: 2021

ISSN: ['1872-7069', '1389-1286']

DOI: https://doi.org/10.1016/j.comnet.2021.107846